Skip to content

Add stats service#207

Merged
ehinman merged 15 commits intoDOI-USGS:mainfrom
ehinman:add-stats-service
Feb 24, 2026
Merged

Add stats service#207
ehinman merged 15 commits intoDOI-USGS:mainfrom
ehinman:add-stats-service

Conversation

@ehinman
Copy link
Collaborator

@ehinman ehinman commented Dec 30, 2025

Adds in two functions that query the two endpoints at: https://api.waterdata.usgs.gov/statistics/v0/docs

Also adds utils functions for parsing and organizing the json response. Water Data API functions could be further edited to include the stats API functions, but for now I kept them separate.

To do (1/8/26):

  • Add percentile values for min/max/median to match R dataretrieval
  • Add unit tests
  • Add examples

@jzemmels jzemmels closed this Jan 9, 2026
@jzemmels jzemmels reopened this Jan 9, 2026
Co-authored-by: Joe Zemmels (he/him) <jzemmels@gmail.com>
@jzemmels jzemmels self-requested a review February 5, 2026 18:39
@ehinman ehinman marked this pull request as ready for review February 20, 2026 22:51
@ehinman
Copy link
Collaborator Author

ehinman commented Feb 20, 2026

@jzemmels @jeffskwang-usgs this PR is ready for your review. It includes the two stats API endpoints and should mirror how the functions work in R. I'm prioritizing getting the functions in so that they can be used in the current conditions pipeline, but eventually a vignette on how to use them would be nice, too. That can be a separate PR.

Thanks for your feedback!

Copy link
Collaborator

@jzemmels jzemmels left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work! Here's a summary of my review:

  • Ran examples in documentation, comparing against what's returned by dR. All looks good
  • Ran the unit tests locally, everything passed
  • We've been using the por function extensively already in the current conditions pipeline, which I think is evidence enough that the functions work as-intended.

I think the only substantive differences between these and the dR functions are the naming convention por_stats vs. stats_por and your inclusion of the expand_percentiles argument. The API output can be a bit confusing depending on the exact settings of computation_type. I'm not sure if there's a precedent from other endpoints for dealing with nested data, so mentioning the subtleties somewhere might be helpful.

measured and the units of measure. A complete list of parameter codes
and associated groupings can be found at
https://help.waterdata.usgs.gov/codes-and-parameters/parameters.
expand_percentiles : boolean
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

May be helpful to also mention that setting expand_percentiles = False and requesting 'percentiles' and one of ['median', 'minimum', 'maximum', 'arithmetic_mean'] will return a value and values column, whereas expand_percentiles = True will consolidate these columns into a single value column. Requesting just 'percentiles' and expand_percentiles = False will return just a values column. There's probably a simpler way to describe this than how I've said.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea, I have added some information about this.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good! I wasn't saying you should change the function names to match dR, just that they were different.

The read_stats_por and read_stats_daterange naming convention was to make it easier for tab-completion (i.e., someone types read_stats then tab to see the two options appear).

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's a good change. The same sort of thing can be applied in python. It's nice to be consistent. Now to deal with this sudden ubuntu failure, ugh.

@jeffskwang-usgs
Copy link

jeffskwang-usgs commented Feb 23, 2026

Hi @ehinman, thanks for including me on this. I've looked over the code, but I'd also like to run the unit tests. I'm unfamiliar with testing python packages, so what's the best way to go about that?

@ehinman
Copy link
Collaborator Author

ehinman commented Feb 23, 2026

Hi @ehinman, thanks for including me on this. I've looked over the code, but I'd also like to run the unit tests. I'm unfamiliar with testing python packages, so what's the best way to go about that?

Thanks Jeffrey! Let's see, you'll want to make sure you have the branch version dataretrieval-python installed in your environment, plus its dependencies, plus pytest. Then, you should be able to navigate to your terminal, make sure it's in the correct repo, and run pytest and it'll find the "tests" folder and run those tests. It creates a little report-out on the PASS/FAIL status of each test.

@jzemmels
Copy link
Collaborator

Hi @ehinman, thanks for including me on this. I've looked over the code, but I'd also like to run the unit tests. I'm unfamiliar with testing python packages, so what's the best way to go about that?

I probably should figure out how to use pytest, but I just ran the test examples and manually checked that the assert statements were true.

@jeffskwang-usgs
Copy link

jeffskwang-usgs commented Feb 24, 2026

Ok, I was having a little diffuculty installing things correctly to run pytest. I used pixi to build an environment from the included pyproject.toml. I needed geopandas to run the test, and ran into issues building the environment because

dataretrieval-python % pixi add geopandas
Error:   × failed to solve the pypi requirements of environment 'default' for platform 'osx-arm64'
  ├─▶ failed to resolve pypi dependencies
  ╰─▶ Because you require pandas>=2.0.0,<3.0.0 and pandas==3.0.1, we can conclude that your requirements are unsatisfiable.
  help: The following PyPI packages have been pinned by the conda solve, and this version may be causing a conflict:
        pandas==3.0.1
        See https://pixi.sh/latest/concepts/conda_pypi/#pinned-package-conflicts for more information.

I had to remove the <3.0.0 contraint from pandas in the pyproject.toml file to get it to build. After that I ran the tests. I believe all the newly added tests for the stats service passed.

dataretrieval-python % pytest -vv tests/waterdata_test.py    
========================================================================================================== test session starts ===========================================================================================================
platform darwin -- Python 3.14.3, pytest-9.0.2, pluggy-1.6.0 -- /Users/jkwang/Desktop/data-ret-test/dataretrieval-python/.pixi/envs/default/bin/python3.14
cachedir: .pytest_cache
rootdir: /Users/jkwang/Desktop/data-ret-test/dataretrieval-python
configfile: pyproject.toml
collected 25 items                                                                                                                                                                                                                       

tests/waterdata_test.py::test_mock_get_samples ERROR                                                                                                                                                                               [  4%]
tests/waterdata_test.py::test_check_profiles PASSED                                                                                                                                                                                [  8%]
tests/waterdata_test.py::test_samples_results PASSED                                                                                                                                                                               [ 12%]
tests/waterdata_test.py::test_samples_activity PASSED                                                                                                                                                                              [ 16%]
tests/waterdata_test.py::test_samples_locations PASSED                                                                                                                                                                             [ 20%]
tests/waterdata_test.py::test_samples_projects PASSED                                                                                                                                                                              [ 24%]
tests/waterdata_test.py::test_samples_organizations PASSED                                                                                                                                                                         [ 28%]
tests/waterdata_test.py::test_get_daily PASSED                                                                                                                                                                                     [ 32%]
tests/waterdata_test.py::test_get_daily_properties PASSED                                                                                                                                                                          [ 36%]
tests/waterdata_test.py::test_get_daily_properties_id PASSED                                                                                                                                                                       [ 40%]
tests/waterdata_test.py::test_get_daily_no_geometry PASSED                                                                                                                                                                         [ 44%]
tests/waterdata_test.py::test_get_continuous FAILED                                                                                                                                                                                [ 48%]
tests/waterdata_test.py::test_get_monitoring_locations PASSED                                                                                                                                                                      [ 52%]
tests/waterdata_test.py::test_get_monitoring_locations_hucs PASSED                                                                                                                                                                 [ 56%]
tests/waterdata_test.py::test_get_latest_continuous FAILED                                                                                                                                                                         [ 60%]
tests/waterdata_test.py::test_get_latest_daily PASSED                                                                                                                                                                              [ 64%]
tests/waterdata_test.py::test_get_latest_daily_properties_geometry PASSED                                                                                                                                                          [ 68%]
tests/waterdata_test.py::test_get_field_measurements PASSED                                                                                                                                                                        [ 72%]
tests/waterdata_test.py::test_get_time_series_metadata PASSED                                                                                                                                                                      [ 76%]
tests/waterdata_test.py::test_get_reference_table PASSED                                                                                                                                                                           [ 80%]
tests/waterdata_test.py::test_get_reference_table_with_query PASSED                                                                                                                                                                [ 84%]
tests/waterdata_test.py::test_get_reference_table_wrong_name PASSED                                                                                                                                                                [ 88%]
tests/waterdata_test.py::test_get_por_stats PASSED                                                                                                                                                                                 [ 92%]
tests/waterdata_test.py::test_get_por_stats_expanded_false PASSED                                                                                                                                                                  [ 96%]
tests/waterdata_test.py::test_get_date_range_stats PASSED                                                                                                                                                                          [100%]

================================================================================================================= ERRORS =================================================================================================================
________________________________________________________________________________________________ ERROR at setup of test_mock_get_samples _________________________________________________________________________________________________
file /Users/jkwang/Desktop/data-ret-test/dataretrieval-python/tests/waterdata_test.py, line 31
  def test_mock_get_samples(requests_mock):
E       fixture 'requests_mock' not found
>       available fixtures: cache, capfd, capfdbinary, caplog, capsys, capsysbinary, capteesys, doctest_namespace, monkeypatch, pytestconfig, record_property, record_testsuite_property, record_xml_attribute, recwarn, subtests, tmp_path, tmp_path_factory, tmpdir, tmpdir_factory
>       use 'pytest --fixtures [testpath]' for help on them.

/Users/jkwang/Desktop/data-ret-test/dataretrieval-python/tests/waterdata_test.py:31
================================================================================================================ FAILURES ================================================================================================================
__________________________________________________________________________________________________________ test_get_continuous ___________________________________________________________________________________________________________

    def test_get_continuous():
        df,_ = get_continuous(
            monitoring_location_id="USGS-06904500",
            parameter_code="00065",
            time="2025-01-01/2025-12-31"
        )
        assert isinstance(df, DataFrame)
        assert "geometry" not in df.columns
        assert df.shape[1] == 11
>       assert df['time'].dtype == 'datetime64[ns, UTC]'
E       AssertionError: assert datetime64[us, UTC] == 'datetime64[ns, UTC]'
E        +  where datetime64[us, UTC] = 0       2025-01-01 00:00:00+00:00\n1       2025-01-01 00:15:00+00:00\n2       2025-01-01 00:30:00+00:00\n3       2025-01-01 00:45:00+00:00\n4       2025-01-01 01:00:00+00:00\n                   ...           \n34525   2025-12-30 23:00:00+00:00\n34526   2025-12-30 23:15:00+00:00\n34527   2025-12-30 23:30:00+00:00\n34528   2025-12-30 23:45:00+00:00\n34529   2025-12-31 00:00:00+00:00\nName: time, Length: 34530, dtype: datetime64[us, UTC].dtype

tests/waterdata_test.py:179: AssertionError
_______________________________________________________________________________________________________ test_get_latest_continuous _______________________________________________________________________________________________________

    def test_get_latest_continuous():
        df, md = get_latest_continuous(
            monitoring_location_id=["USGS-05427718", "USGS-05427719"],
            parameter_code=["00060", "00065"]
        )
        assert "latest_continuous_id" == df.columns[-1]
        assert df.shape[0] <= 4
        assert df.statistic_id.unique().tolist() == ["00011"]
        assert hasattr(md, 'url')
        assert hasattr(md, 'query_time')
>       assert df['time'].dtype == 'datetime64[ns, UTC]'
E       AssertionError: assert datetime64[us, UTC] == 'datetime64[ns, UTC]'
E        +  where datetime64[us, UTC] = 0   2026-02-24 14:00:00+00:00\n1   2026-02-24 14:00:00+00:00\nName: time, dtype: datetime64[us, UTC].dtype

tests/waterdata_test.py:207: AssertionError
============================================================================================================ warnings summary ============================================================================================================
dataretrieval/__init__.py:9
  /Users/jkwang/Desktop/data-ret-test/dataretrieval-python/dataretrieval/__init__.py:9: DeprecationWarning: The 'nwis' services are deprecated and being decommissioned. Please use the 'waterdata' module to access the new services.
    from dataretrieval.nwis import *

-- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html
======================================================================================================== short test summary info =========================================================================================================
FAILED tests/waterdata_test.py::test_get_continuous - AssertionError: assert datetime64[us, UTC] == 'datetime64[ns, UTC]'
 +  where datetime64[us, UTC] = 0       2025-01-01 00:00:00+00:00\n1       2025-01-01 00:15:00+00:00\n2       2025-01-01 00:30:00+00:00\n3       2025-01-01 00:45:00+00:00\n4       2025-01-01 01:00:00+00:00\n                   ...           \n34525   2025-12-30 23:00:00+00:00\n34526   2025-12-30 23:15:00+00:00\n34527   2025-12-30 23:30:00+00:00\n34528   2025-12-30 23:45:00+00:00\n34529   2025-12-31 00:00:00+00:00\nName: time, Length: 34530, dtype: datetime64[us, UTC].dtype
FAILED tests/waterdata_test.py::test_get_latest_continuous - AssertionError: assert datetime64[us, UTC] == 'datetime64[ns, UTC]'
 +  where datetime64[us, UTC] = 0   2026-02-24 14:00:00+00:00\n1   2026-02-24 14:00:00+00:00\nName: time, dtype: datetime64[us, UTC].dtype
ERROR tests/waterdata_test.py::test_mock_get_samples
=========================================================================================== 2 failed, 22 passed, 1 warning, 1 error in 16.10s ============================================================================================

@ehinman
Copy link
Collaborator Author

ehinman commented Feb 24, 2026

@jeffskwang-usgs, thanks for running these test on your machine! I believe the first error is due to the fact that you do not have all the modules installed to run the tests, namely requests-mock. Is that correct? At any rate, that is not used to test unit tests added in this MR. The other two failures I believe are related to differences in pandas 2.x.x and pandas 3.x.x: the latter uses a slightly different notation for time (ns vs us). Tim and I have discussed bumping up the dependency to include 3.x.x, but decided that would be its own separate MR.

@jeffskwang-usgs
Copy link

That's right, I was able to get that part to pass after using pixi in install it:

dataretrieval-python % pixi add pytest requests-mock
 WARN The package `pytest-cov==7.0.0` does not have an extra named `all`
✔ Added pytest >=9.0.2,<10
✔ Added requests-mock >=1.12.1,<2
dataretrieval-python % pytest -vv tests/waterdata_test.py                                   
===================================================================================== test session starts ======================================================================================
platform darwin -- Python 3.14.3, pytest-9.0.2, pluggy-1.6.0 -- /Users/jkwang/Desktop/data-ret-test/dataretrieval-python/.pixi/envs/default/bin/python3.14
cachedir: .pytest_cache
rootdir: /Users/jkwang/Desktop/data-ret-test/dataretrieval-python
configfile: pyproject.toml
plugins: requests-mock-1.12.1
collected 25 items                                                                                                                                                                             

tests/waterdata_test.py::test_mock_get_samples PASSED   

@ehinman ehinman merged commit 4dc9f6a into DOI-USGS:main Feb 24, 2026
7 of 13 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants